Method and system for estimating gaze target, gaze sequence, and gaze map from video

ABSTRACT

The present invention is a method and system to estimate the visual target that people are looking, based on automatic image measurements. The system utilizes image measurements from both face-view cameras and top-down view cameras. The cameras are calibrated with respect to the site and the visual target, so that the gaze target is determined from the estimated position and gaze direction of a person. Face detection and two-dimensional pose estimation locate and normalize the face of the person so that the eyes can be accurately localized and the three-dimensional facial pose can be estimated. The eye gaze is estimated based on either the positions of localized eyes and irises or on the eye image itself, depending on the quality of the image. The gaze direction is estimated from the eye gaze measurement in the context of the three-dimensional facial pose. From the top-down view the body of the person is detected and tracked, so that the position of the head is estimated using a body blob model that depends on the body position in the view. The gaze target is determined based on the estimated gaze direction, estimated head pose, and the camera calibration. The gaze target estimation can provide a gaze trajectory of the person or a collective gaze map from many instances of gaze.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is a method and system to estimate the visualtarget that people are looking at based on automatic image measurements.

2. Background of the Invention

A person's interest can often be revealed by observing where he or sheis looking. In certain environments, how often a certain visual targetreceives attention can provide valuable information that is of highcommercial significance. For example, more frequent shoppers' visualattentions toward certain products in a retail space could result inhigher sales of products. The gaze change of a shopper can revealhis/her mental process, such as change of interest toward products.Furthermore, if one can estimate a “gaze map”—of how often each spatialregion within a visual target receives visual attention—of a shelfspace, the data can provide very valuable information to retailers andmarketers. The collected statistics can be utilized by themselves toassess the commercial viability of a certain region in shelf space,which can then translate into the degree of success of certain products,product placement, packaging, or promotions. The information can also beanalyzed in relation to other information, such as purchase data orshopper-product interaction data. For the advertisement industry, theinformation regarding how much each region in an advertisement displayreceives attention can be used to measure its effectiveness and improveits design accordingly. There can be other potential applications, suchas human psychology/behavior research or interface design, that involvemeasurements of gaze.

The present invention introduces a comprehensive system and method toestimate a gaze target within a potential visual target. A visual targetis an object that people view and the way people look at it—the spatialtrajectory, the duration, or the frequency of gaze—carries somesignificance to a given application. The gaze target—the location withinthe visual target where the person's gaze is fixated—is estimated bymeasuring the eye gaze of the person as well as the facial pose of theperson; the eye gaze of a person is defined as the orientation of theperson's gaze relative to the person's face. An automated analysis ofthe person's image captured from at least one camera provides themeasurement for estimating the gaze target. The cameras are placed andoriented so that they can capture the faces of potential viewers of thevisual target; the cameras are typically placed near the visual target.The focal lengths of the camera lenses are determined to capture thefaces large enough so that the gaze can be accurately estimated. Atop-down view camera can be employed to estimate the floor position ofthe person, which helps to accurately identify the gaze target. Theimage analysis algorithm should be calibrated based on the cameraplacement and the geometry of the visual target. The effectiveresolution of the estimated gaze map is constrained by both the accuracyof the eye gaze estimation algorithm and the distance of the cameras tothe person.

Recent developments in computer vision and artificial intelligencetechnology make it possible to detect and track people's faces andbodies from video sequences. Facial image analysis has been especiallymatured, so that faces can be detected and tracked from video images andthe pose of the head and the shapes of the facial features can also beestimated. Especially, the three-dimensional facial pose and eye gazecan be measured to estimate the gaze target. Face detection and trackinghandle the problem of locating faces and establishing correspondencesamong detected faces that belong to the same person. To be able toaccurately locate the facial features, the two-dimensional (position,size, and orientation) pose of the face is first estimated. Accuratepositions and sizes of facial features are estimated in a similarmanner. The estimated positions of irises relative to the eyes alongwith the estimated head orientation reveal the shopper's gaze direction.However, because of the nonlinear way that different facial poses affectthe appearance changes in eye image due to the eye gaze, a machinelearning-based method is introduced to find the facial pose-dependentgaze direction estimation. The final gaze target is estimated based onthe estimated gaze direction and the person position (more specifically,the head position). Because the head position relative to the bodychanges according to the position in the view, the head position isestimated by employing a view-dependent body blob model.

There have been prior attempts for automatically estimating the gazedirection of a human observer.

U.S. Pat. No. 5,797,046 of Nagano, et al. (hereinafter Nagano) discloseda visual axis controllable optical apparatus, which is used in differentpostures. The optical apparatus includes a light detecting device forreceiving light reflected by an eye, and detecting the intensitydistribution of the received light, a storage device for storingpersonal data associated with a personal difference of the eye incorrespondence with the different postures, and a visual axis detectingdevice for detecting a visual axis. The visual axis detecting devicedetects the position of the visual axis using the personal data storedin the storage device corresponding to the posture of the opticalapparatus, and the intensity distribution detected by the lightdetecting device.

U.S. Pat. No. 5,818,954 of Tomono, et al. (hereinafter Tomono) discloseda method that calculates a position of the center of the eyeball as afixed displacement from an origin of a facial coordinate systemestablished by detection of three points on the face, and computes avector therefrom to the center of the pupil. The vector and the detectedposition of the pupil are used to determine the visual axis.

U.S. Pat. No. 6,154,559 of Beardsley (hereinafter Beardsley) disclosed asystem that is designed to classify the gaze direction of an individualobserving a number of surrounding objects. The system utilizes aqualitative approach in which frequently occurring head poses of theindividual are automatically identified and labeled according to theirassociation with the surrounding objects. In conjunction with processingof the eye pose, this enables the classification of gaze direction. Inone embodiment, each observed head pose of the individual isautomatically associated with a bin in a “pose-space histogram.” Thishistogram records the frequency of different head poses over an extendedperiod of time. Each peak is labeled using a qualitative description ofthe environment around the individual. The labeled histogram is thenused to classify the head pose of the individual in all subsequentimages. This head pose processing is augmented with eye pose processing,enabling the system to rapidly classify gaze direction without accuratea priori information about the calibration of the camera utilized toview the individual, without accurate a priori three-dimensionalmeasurements of the geometry of the environment around the individual,and without any need to compute accurate three-dimensional metricmeasurements of the individual's location, head pose or eye direction atrun-time.

U.S. Pat. No. 6,246,779 of Fukui, et al. (hereinafter Fukui) disclosed agaze position detection apparatus. A dictionary section previouslystores a plurality of dictionary patterns representing a user's imageincluding pupils. An image input section inputs an image including theuser's pupils. A feature point extraction section extracts at least onefeature point from a face area on the input image. A pattern extractionsection geometrically transforms the input image according to a relativeposition of the feature point on the input image, and extracts a patternincluding the user's pupils from the transformed image. A gaze positiondetermination section compares the extracted pattern with the pluralityof dictionary patterns, and determines the user's gaze positionaccording to the dictionary pattern matched with the extracted pattern.

U.S. Pat. No. 7,043,056 of Edwards, et al. (hereinafter Edwards)disclosed a method of determining an eye gaze direction of an observer,comprising the steps of: (a) capturing at least one image of theobserver and determining a head pose angle of the observer; (b)utilizing the head pose angle to locate an expected eye position of theobserver, and (c) analyzing the expected eye position to locate at leastone eye of the observer and observing the location of the eye todetermine the gaze direction.

U.S. Pat. No. 7,046,924 of Miller, et al. (hereinafter Miller) discloseda method that is provided for determining an area of importance in anarchival image. In accordance with this method, eye information,including eye gaze direction information captured during an imagecapture sequence for the archival image, is obtained. An area ofimportance in the archival image is determined based upon the eyeinformation. Area of importance data characterizing the area ofimportance is associated with the archival image.

U.S. Pat. No. 7,197,165 of Ryan (hereinafter Ryan) disclosed a computerprocessing apparatus, where frames of image data received from a cameraare processed to track the eyes of a user in each image. Athree-dimensional computer model of a head is stored, and search regionsare defined in the three-dimensional space corresponding to the eyes andeyebrows. For each image, pixels within the projection of the searchregions from the three-dimensional space to the two-dimensional imagespace are sampled to determine a representative intensity value for eachof the search regions. Positions for the eyes in the three-dimensionalspace are then calculated based on the determined values. Thethree-dimensional computer model and search bands are moved within thethree-dimensional space to align the eyes with the calculated eyepositions. In this way, when the next image is processed, the searchbands project into the image from a head configuration determined fromthe previous image. This facilitates reliable and accurate eye tracking.

U.S. patent application Ser. No. 10/605,637 of Larsson, et al.(hereinafter Larsson) disclosed a method for analyzing ocular and/orhead orientation characteristics of a subject. A detection andquantification of the position of a driver's head and/or eye movementsare made relative to the environment. Tests of the data are made, andfrom the data, locations of experiencedareas/objects-of-subject-interest are deduced. When a driver of avehicle is the subject, these areas/objects-of-driver-interest may beinside or outside the vehicle, and may be constituted by (1) “things”such as audio controls, speedometers and other gauges, and (2) areas orpositions such as “road ahead” and lane-change clearance space inadjacent lanes. In order to “standardize” the tracking data with respectto the vehicle of interest, the quantification of the position of thedriver's head is normalized to the reference-base position, therebyenabling deducement of the location(s) where the driver has shown aninterest based on sensed information regarding either or both of (1)driver ocular orientation or (2) driver head orientation.

In Nagano, the gaze direction is estimated based on the optical signalof the light reflected by the iris, and on the stored personal signatureof the reflection. In Tomono, the measured position of the iris relativeto the measured facial coordinate is used to estimate the gaze. InBeardsley, the gaze target is recognized based on the measurement of thehead pose and the correlation between a known visual target and the headpose, using the head pose histogram of frequent gaze targets. In Fukui,the gaze is estimated by comparing the measured facial image featurepattern against the stored facial image feature patterns, using neuralnetworks. In Edwards, the eye gaze direction is estimated by firstdetermining the head pose angle and then by locating the iris positionrelative to the eye region based on a precise geometric model of eyes.In Miller, the eye gaze direction and its path are estimated to identifyan area of importance in images. In Ryan, a three-dimensional head modelis utilized to estimate the head pose and gaze. The present inventionemploys basic ideas similar to the mentioned inventions; first estimatethe head pose, and locate the eye and iris positions. The position ofthe irises relative to the localized eyes provides the data to estimatethe gaze direction. However, we adopt a series of machine learning-basedapproaches to accurately and robustly estimate the gaze under realisticimaging conditions; a two-dimensional facial pose estimation followed bya three-dimensional facial pose estimation, where both estimationsutilize multiple learning machines. The facial features are alsoaccurately localized based on the estimated global facial geometry,again using combinations of multiple learning machines, and each takepart in localizing a specific facial feature. Each of these machinelearning-based estimations of poses or locations utilizes a set offilters specifically designed to extract image features that arerelevant to a given estimation problem. Finally, the eye gaze estimatesare interpreted differently with varying head pose estimates, toestimate the gaze direction and gaze target. Unlike most of the priorinventions, which focus on close-range visual targets, the presentinvention aims to estimate gaze regardless of distance, using a seriesof robust methods for face detection, pose estimation, and eye gazeestimation. To deal with the problem of gaze target estimation from adistance, the position of the head (the starting point of the gaze) isrobustly estimated. Due to the varying head position relative to thebody, the head position is estimated by employing a view-dependent bodyblob model.

In summary, the present invention provides robust facial pose estimationand eye gaze estimation approaches by adopting a series of machinelearning-based approaches to accurately and robustly estimate the gazeunder realistic imaging conditions, without using specialized imagingdevices and without requiring close-range images or priorthree-dimensional face models. The eye gaze is processed in the contextof varying facial pose, so that the appearance changes of the eyes dueto pose changes can be properly handled. The top-down view imageanalysis for locating the head of the viewer helps to achieve accurategaze target estimation. The present invention also provides acomprehensive framework for site calibration, performancecharacterization, and site-specific data collection.

SUMMARY

The present invention is a method and system for automaticallyidentifying the gaze target of a person within a visual target, bymeasuring the person's head pose and eye gaze.

It is one of the objectives of the first step of the processing to takemeasurements of the site and the visual target, to come up with a cameraspecifications and placement plan. The step will also provide the targetgrid resolution, as well as the calibration for the face-view camerasand the top-down view camera.

It is one of the objectives of the second step of the processing todetect faces, track them individually, and estimate both thetwo-dimensional and three-dimensional poses of each of the trackedfaces. Given a facial image sequence, the step detects any human facesand keeps each of their individual identities by tracking them. Usinglearning machines trained from facial pose estimation training, thetwo-dimensional facial pose estimation step computes the (X, Y) shift,size variation, and orientation of the face inside the face detectionwindow to normalize the facial image, as well as to help thethree-dimensional pose estimation. The three-dimensional facial poseestimation computes the yaw (horizontal rotation) and pitch (verticalrotation) angles of the face, after the two-dimensional pose has beennormalized.

It is one of the objectives of the third step of the processing tolocalize the facial features and estimate the eye gaze. The facialfeature localization utilizes facial feature localization machines,where multiple learning machines are trained for each facial featurethat is already roughly localized based on the estimated two-dimensionalfacial pose. The eye gaze is estimated based on the deviation of theestimated iris positions from the estimated eye positions.

It is one of the objectives of the fourth step of the processing toestimate the gaze direction of a person, based on the estimatedthree-dimensional facial pose and the eye gaze. The step computes thegaze direction by finding a three-dimensional facial pose-dependentmapping from the three-dimensional facial pose and the eye gaze to thegaze direction.

It is one of the objectives of the fifth step of the processing toestimate the floor position of the person's head from the top-down viewof the person whose facial image is being analyzed for estimating thegaze. First, the body of the person is detected and tracked. Using abody blob model that depends on the position in the top-down view, anaccurate head position is estimated.

It is one of the objectives of the sixth step of the processing toestimate the gaze target of the person based on the estimated gazedirection and person position. From the gaze target estimates of theperson over time, a gaze trajectory can be constructed. From the gazedirection estimates of a large number of people, a gaze map can beestimated.

DRAWINGS Figures

FIG. 1 is an overall scheme of the system in a preferred embodiment ofthe invention.

FIG. 2 shows a view of the system of the invention in an operationalenvironment in an exemplary embodiment.

FIG. 3 shows the visual target and the estimated gaze target in a visualtarget grid in an exemplary embodiment of the present invention.

FIG. 4 shows the site calibration step in an exemplary embodiment of thepresent invention.

FIG. 5 shows the steps of gaze direction estimation, along with off-linetraining steps necessary for some of the steps, and a data flowproviding an appropriate collection of training data to each of thetraining steps.

FIG. 6 shows the steps in person position estimation.

FIG. 7 shows one of the features of the site calibration step where thetarget resolution is being determined.

FIG. 8 shows a series of facial image processing steps, from facedetection, to two-dimensional facial pose estimation, and to facialfeature localization.

FIG. 9 shows a two-dimensional facial pose estimation training scheme inan exemplary embodiment of the present invention.

FIG. 10 shows an exemplary sampling of (yaw, pitch) ranges forthree-dimensional facial pose estimation in an exemplary embodiment ofthe present invention.

FIG. 11 shows a three-dimensional facial pose estimation training schemein an exemplary embodiment of the present invention.

FIG. 12 shows a facial feature localization training scheme in anexemplary embodiment of the present invention.

FIG. 13 shows the facial feature localization scheme in an exemplaryembodiment of the present invention.

FIG. 14 shows the instances of different eye gaze.

FIG. 15 shows an exemplary embodiment of the eye gaze annotation step.

FIG. 16 shows a function of the three-dimensional facial pose-dependentgaze direction estimation step.

FIG. 17 shows an exemplary scheme of the three-dimensional facialpose-dependent gaze direction estimation step.

FIG. 18 shows an exemplary embodiment of the three-dimensional facialpose-dependent gaze direction estimation step.

FIG. 19 shows the person position estimation and gaze target estimationstep in an exemplary embodiment of the present invention.

FIG. 20 shows the view selection scheme in an exemplary embodiment ofthe present invention.

FIG. 21 shows an estimated gaze map.

FIG. 22 shows an exemplary embodiment of the weighed voting scheme forgaze map estimation.

FIG. 23 shows an estimated gaze trajectory.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 is an overall scheme of the system in a preferred embodiment ofthe invention. The procedure in the dashed box illustrates off-lineprocessing that is necessary for some of the modules of the system.First, measurements of the site and target geometry 932 are provided, sothat the site calibration 935 step can generate the image to floormapping 939; the mapping is used in the person position estimation 725step to convert the image coordinates of the tracked people to the worldcoordinates on the floor. The site calibration 935 step also generatesgaze to target grid mapping 974, so that the gaze direction estimate 952from the gaze direction estimation 950 step can be interpreted to gazetarget 925 in the gaze target estimation 970 step. The target gridresolution 923 can also be derived from the site calibration 935 step sothat the gaze trajectory estimation 987 or gaze map estimation 985 canbe performed up to that precision.

FIG. 2 shows a view of the system of the invention in an operationalenvironment in an exemplary embodiment. The first means for capturingimages 101 is placed near the visual target 920 to capture the face view342 of the viewer 705 looking at the visual target 920. The second meansfor capturing images 102 is placed at a different position so as tocapture the top-down view 347 of the body image of the viewer 705. Thevideo feeds from both the first means for capturing images 101 and thesecond means for capturing images 102 are connected to the control andprocessing system 162 via means for video interface 115 and processed bythe control and processing system 162. The video feed from the firstmeans for capturing images 101 is processed by the control andprocessing system 162 to estimate the gaze direction 901. The video feedfrom the second means for capturing images 102 is processed to estimatethe position of the viewer 705.

FIG. 3 shows the visual target 920 and the estimated gaze target 925 inthe visual target grid 922 in an exemplary embodiment of the presentinvention. In this embodiment, each square block in the visual targetgrid 922 represents a location in a shelf space that is of interest. Thesize of the block represents the target grid resolution 923, and isdetermined from the site calibration 935 step. Based on the analysis ofthe images from the face-view camera 110, the gaze direction 901 of eachviewer 705 is estimated and is shown as a dotted arrow, and the gazetarget 925 is estimated and shown as a black dot.

FIG. 4 shows the site calibration 935 step in an exemplary embodiment ofthe present invention. Two inputs to the step are the gaze directionestimation error distribution 953 and the measured site and targetgeometry 932. The gaze direction estimation error distribution 953represents the distribution of errors in gaze direction, and is providedby an empirical evaluation of the algorithm. The site and targetgeometry 932 includes the size of the area to be observed by cameras,the distance between the gaze target and the position where the viewerstypically stand to watch, the size of the visual target 920, etc. Fromthe site and target geometry 932, the camera specifications (focallength and field-of-view angle) and positions (including theorientations) 936 are determined to cover the areas to be monitored. Intypical scenarios, multiple face-view cameras are employed. Then theresulting position of the visual target and the viewers relative toface-view cameras 938 determines the target grid resolution 923 from thegaze direction estimation error distribution 923. Here, a representativeposition of the viewer 705 is used. The same error distribution (938) isused to compute the gaze to target grid mapping 974, so that the eyegaze estimate 961 can be interpreted in world coordinates to identifythe location of the gaze target 925 within the visual target 920. It isimportant to interpret the gaze direction in the context of each faceview camera 110 (especially the camera orientation), because theestimated gaze direction is the gaze angle relative to the camera. Thecamera specifications and positions 936 (especially the camera height)is used to calibrate the top-down view 347; that is, to compute theimage to floor mapping 939.

FIG. 5 shows the steps of gaze direction estimation 950, along withoff-line training steps necessary for some of the steps, and a data flowproviding an appropriate collection of training data to each of thetraining steps. First, the face view 342 from the face-view camera 110captures the face of a viewer 705. The face detection 360 step detectsthe face and estimates its approximate position and size. The facetracking 370 step maintains the identities of the faces so that theybelong to the same person. The detected faces along with theirground-truth positions, sizes, and orientations constitute the trainingdata for two-dimensional facial pose estimation training 820. Thetracked faces then go through the two-dimensional facial pose estimation380 step, where the learning machines trained in the two-dimensionalfacial pose estimation training 820 step are employed. The facialimages, whose positions, sizes, and orientations are corrected accordingto the estimated two-dimensional poses, are fed to both thethree-dimensional facial pose estimation training 830 and the facialfeature localization training 840. The three-dimensional facial poseestimation training 830 requires the ground-truth yaw and pitch anglesof each face, and the facial feature localization training 840 requiresthe two-dimensional geometry (position, size, and orientation) of eachfacial feature of faces. The facial feature localization training 840 iscarried out independently for each facial feature. The facial imageswhose two-dimensional pose have been estimated and corrected are fed tothe three-dimensional facial pose estimation 390 step to estimate theyaw and pitch angles of the faces; they are also fed to the facialfeature localization 410 step to accurately localize the individualfacial features. Especially, the localized and normalized (so that theyare aligned to the standard position and size) eye images are furthertrained to estimate the eye gaze in the eye gaze estimation training 964step. Then the trained learning machines estimate the eye gaze in theeye gaze estimation 960 step. In an exemplary embodiment wherehigh-resolution facial images are available, eye gaze is estimated fromthe difference between the position of the iris (estimated from facialfeature localization) and the eye center position (also estimated fromfacial feature localization). The three-dimensional facial poseestimated from the three-dimensional facial pose estimation 390 stepalong with the ground-truth eye gaze data are used to train thethree-dimensional facial pose-dependent gaze direction estimationtraining 957. The trained machines are employed in the three-dimensionalfacial pose-dependent gaze direction estimation 955 step, to estimatethe gaze direction. In another embodiment where the resolution of thefacial images is low or the images are of low quality, the eye imageitself along with the three-dimensional facial pose directly trains thethree-dimensional facial pose-dependent gaze direction estimationtraining 957. In this embodiment, the eye gaze estimation training 964step and eye gaze estimation 960 step are skipped. In anotherembodiment, the system switches between two modes depending on thedistance of the face from the camera. The first mode is the explicit eyegaze estimation 960 step followed by the three-dimensional facialpose-dependent gaze direction estimation 955 step. The second mode iswhere the three-dimensional facial pose-dependent gaze directionestimation 955 handles the gaze direction estimation 955 withoutexplicitly estimating the eye gaze.

In an exemplary embodiment where multiple cameras are employed, the bestview is selected in the view selection 976 step, and the view that isdetermined to provide the most accurate gaze direction estimation 950 isfed to the three-dimensional facial pose-dependent gaze directionestimation 955. The correspondences between faces appearing to differentface views are made based on the person position estimate 726 from theperson position estimation 725 step.

FIG. 6 shows the steps in person position estimation 725. The bodydetection training 728 step generates learning machines necessary forthe body detection 722 step. The body detection training 728 utilizesthe top-down view 347 images of people to train learning machines forthe body detection 720 step. In operation, the top-down view 347 isfirst processed by foreground object segmentation 718 to identifyregions where people's body images appear against the static background.This step serves both to limit the search space for body detection andto reduce false detections. The body detection 720 step then searchesthe foreground region for human bodies. The body tracking 721 step keepsidentities of the people so that the correspondences among body imagescan be established. The image-to-floor mapping 939 computed from thesite calibration 935 step changes the image coordinates of the bodiesinto the world coordinates on the floor. In the top-down view 347 of ahuman body image, the position of the head—the measurement that is morerelevant to the gaze target estimation 970—relative to the body centerof the detected body—depends on its position in the view. The view-basedbody blob estimation 733 step finds the shape and orientation of theblob according to the prior model of the orientation of the human figureat each floor position in the view. The estimated view-based body blobis used to accurately locate the head position 735.

FIG. 7 shows one of the features of the site calibration 935 step wherethe target grid resolution 923 is being determined, in an exemplaryembodiment of the present invention. The gaze direction estimation errordistribution 953 is the spatial distribution of the gaze directionestimation error. The accuracy of the gaze target estimate should bederived from the gaze direction estimation error distribution 953, butwill depend on the distance. When the typical distance between theviewer 705 and a visual target 920 (denoted by the lower rectangle inthe figure) is small (denoted as d), the gaze target estimation errordistribution 971 will have a narrow shape that has less uncertainty.When another visual target 920 (denoted by the upper rectangle) ispositioned at twice the distance (denoted as D), the gaze targetestimation error distribution 971 will have a flat shape that has moreuncertainty. The distribution is used to determine the effectiveresolution, or the size of the target grid 922, of the visual target.

FIG. 8 shows a series of facial image processing steps, from facedetection 360, to two-dimensional facial pose estimation 380, and tofacial feature localization 410. Any image-based face detectionalgorithm can be used to detect human faces from an input image frame330. Typically, a machine learning-based face detection algorithm isemployed. The face detection algorithm produces a face window 366 thatcorresponds to the locations and the sizes of the detected face. Thetwo-dimensional facial pose estimation 380 step estimates thetwo-dimensional pose of the face to normalize the face to a localizedfacial image 384, where each facial feature is approximately localizedwithin a standard facial feature window 406. The facial featurelocalization 410 step then finds the accurate locations of each facialfeature to extract them in a facial feature window 403.

FIG. 9 shows a two-dimensional facial pose estimation training scheme512 in an exemplary embodiment of the present invention. The trainingfaces 882 are generated by applying the random perturbation of the (xf,yf) position, size sf, and the orientation of to each of the manuallyaligned faces. The ranges (or distribution) of the perturbation arechosen to be the same as the ranges (or distribution) of actualgeometric variation of the faces from the face detection. Given an inputface, the machine having the inherent pose of (x, y, x, o) is trained tooutput the likelihood of the given input face having the inherent pose.If the input training face has the pose (xf, yf, sf, of), then thetarget output is the Gaussian likelihood:L=Exp(−(xf−x)*(xf−x)/kx−(yf−y)*(yf−y)/ky−(sf−s)*(sf−s)/ks−(of−o)*(of−o)/ko).kx, ky, ks, ko are constants determined empirically. The figure alsoillustrates the response 813 profile that each machine is trained tolearn. Each machine is trained to produce a peak for the faces havingthe corresponding two-dimensional pose, and to produce gradually lowervalues as the two-dimensional pose changes from the inherenttwo-dimensional pose of the machine. The figure is shown only for thetwo dimensions (s, o)=(scale, orientation) for the purpose of clearpresentation.

FIG. 10 shows an exemplary sampling of (yaw, pitch) ranges forthree-dimensional facial pose estimation 390 in an exemplary embodimentof the present invention. In one of the exemplary embodiments, each set(yaw, pitch) of geometric parameters is chosen by a sampling from theranges of possible values. The range is typically determined by thetarget pose ranges to be estimated. In the exemplary embodiment shown inthe figure, the table shows such sampled pose bins, where each posedimension is split into 5 pose bins.

FIG. 11 shows a three-dimensional facial pose estimation training scheme830 in an exemplary embodiment of the present invention. The facialimages used for training are typically normalized from two-dimensionalfacial pose estimation 380, so that the system estimatesthree-dimensional facial pose more efficiently. The figure illustratesthe response 813 profile that each machine is trained to learn. Eachmachine is trained to produce a peak for the faces having thecorresponding (yw, pt), and to produce gradually lower values as the(yw, pt) changes from the inherent (yw, pt) of the machine. Themathematical expression for the response 813 profile is very similar tothe equation for the case of two-dimensional facial pose estimationtraining 820.

FIG. 12 shows a facial feature localization training scheme 840 in anexemplary embodiment of the present invention. This exemplary trainingscheme aims to estimate the x (horizontal) shift, y (vertical) shift,the scale, and the orientation of the right eye within the standardfacial feature window 406. Each eye image 421 is generated by croppingthe standard facial feature window 406 of the right eye from thelocalized facial image 384. The facial landmark points of the face areassumed to be known, and the coordinates of the landmark points 657,after going through the localization based on the two-dimensional facialpose estimation 380 step, are available.

Given an input right eye image 421, the machine having the inherentgeometry of (x0, y0, s0, o0) is trained to output the likelihood of theeye image 421 having the inherent geometry. If the input training eyehas the (ex, ey, es, eo), then the target output is the Gaussianlikelihood:L=Exp(−(ex−x0)*(ex−x0)/kx−(ey−y0)*(ey−y0)/ky−(es−s0)*(es−s0)/ks−(eo−o0)*(eo−o0)/ko).kx, ky, ks, and ko are constants determined empirically. (ex, ey, es,eo) can be easily determined beforehand using the coordinates of thelandmark points relative to the standard facial feature positions andsizes. Each plot in the figure illustrates the responses 813 profilethat each machine is trained to learn. Each machine is trained toproduce a peak for the eye image 421 having the matching geometry, andto produce gradually lower values as the geometry changes from theinherent geometry of the machine. In this exemplary embodiment, multiplelearning machines are employed to estimate the x-location and the scaleof the right eye, where each machine is tuned to a specific (x-shift,scale) pair; the figure is illustrated only for the two dimensions (x,s)=(x-shift, scale) for the purpose of clear presentation.

FIG. 13 shows the facial feature localization 410 scheme in an exemplaryembodiment of the present invention. The two-dimensional facial poseestimation 380 step and the three-dimensional facial pose estimation 390step can be performed on a facial image, in a similar manner to thefacial feature localization. Once each facial feature tuned machine 844has been trained to output the likelihood of the given facial feature tohave the predetermined pose vector (xi, yi, si, oi), an array of suchlearning machines can process any facial feature image to compute thelikelihoods. In the figure, a given eye image 421 inside the standardfacial feature window 406 is fed to the trained learning machines, andthen each machine outputs the responses 813 to the particular posevector 462 (xi, yi, si, oi). The responses are then normalized 815 bydividing them by the sum of the responses to generate the weights 817.The weight is then multiplied to the corresponding pose vector (xi, yi,si, oi). The pose vectors (x1, y1, s1, o1), (xN,yN,sN,oN) are weightedand added up to compute the estimated pose vector (x*, y*, s*, o*). Thepose vector represents the difference in position, scale, andorientation that the given eye image 421 has against the standard eyepositions and sizes. The pose vector is used to correctly extract thefacial features.

FIG. 14 shows the instances of different eye gaze 910. The columnsrepresent different yaw (horizontal) direction of the eye gaze; the rowsrepresent different pitch (vertical) direction of the eye gaze. Becauseeach eye gaze 910 renders a unique appearance change of the eye, theimage signature is used to estimate the eye gaze.

FIG. 15 shows an exemplary embodiment of the eye gaze annotation 962step that is necessary for the eye gaze estimation training 964 step. Inthis embodiment, the human annotator determines the degree of confidence963 of each of the determined eye gaze 910. The eye gaze annotationconfidence 963 is introduced to deal with the eye gaze ambiguity that aneye image has; the ambiguity arises due to image resolution, thedistance between the face and camera, lighting conditions, eye size andshape, etc. In the figure, the eyes 3 428 image and eyes 5 430 imagehave less confidence due to the small sizes of the eyes. The eyes 4 429image has less confidence due to the low resolution. The learningmachine trained using the annotated data can estimate both the eye gaze910 and the level of confidence 963.

FIG. 16 shows a function of the three-dimensional facial pose-dependentgaze direction estimation 955 step. The eye gaze 910 estimated from theeye gaze estimation 960 step is manifested by the movement of the iris,and is independent of the three-dimensional facial pose. Therefore, thetwo instances of the same eye gaze can actually point to a differentgaze target 925 depending on the three-dimensional facial pose 391. Inthe figure, the top face shows the frontal pose, and the correspondingeye gaze reveals that the person is looking to the right. Therefore, theeye gaze 910 is the only element to estimate the gaze direction 901. Inthe bottom face, the face is pointing to the right. The eye gaze 910 ofthe person appears to be very similar to the first face, but needs to beinterpreted differently from the case with the top face; thethree-dimensional facial pose 391 should be additively incorporated tothe final estimate of gaze direction. The way to combine the eye gaze910 and the three-dimensional facial pose 391 can be learned using alearning machine-based method.

FIG. 17 shows an exemplary scheme of the three-dimensional facialpose-dependent gaze direction estimation 955 step. Each columnrepresents a different three-dimensional facial pose 391 (different yawangles), and each row represents a different eye gaze 910. Theorientation of each gaze direction estimate 961 corresponding to thethree-dimensional facial pose and eye gaze is illustrated using an arrowin a circle. The middle (third) column shows frontal facial pose, andthe horizontal position of the iris relative to the eye simplytranslates to the gaze direction. When the face is pointing to the right(first column), it gives additive bias (to the right of the person) tothe gaze direction estimate.

FIG. 18 shows an exemplary embodiment of the three-dimensional facialpose-dependent gaze direction estimation 955 step. In this embodiment,multiple learning machines are used, where each machine is athree-dimensional facial pose-dependent learning machine 958 that istrained for a particular three-dimensional facial pose 391. Once the eyegaze 910 and three-dimensional facial pose 391 are estimated from faceview 342, they are fed to each of the machines. In one embodiment, onlythe machine whose pose range contains the estimated three-dimensionalfacial pose 391 is activated to estimate the gaze direction. In anotherembodiment, all the machines are activated, but the output gazedirection estimates are weighted to produce the gaze direction estimate;the weight 816 is proportional to a measure of similarity between theestimated three-dimensional facial pose and the inherent pose of themachine. In the figure, the weight 816 is denoted as the thickness ofthe arrow.

In a scenario when the eye images are not large enough for a reliableeye gaze estimation, the normalized eye image 421 (instead of estimatedeye gaze) along with the three-dimensional facial pose is fed to themachines to estimate the gaze direction. In this embodiment, themachines are trained to process the normalized eye image 421 and thethree-dimensional facial pose estimate 391, instead of trained toprocess the eye gaze 910 and the three-dimensional facial pose estimate391.

FIG. 19 shows the person position estimation 725 and gaze targetestimation 970 steps in an exemplary embodiment of the presentinvention. The person position estimation 725 step provides the worldcoordinate of the person—more precisely, the position of the head 735.From the detected body from the body detection 720 and body tracking 721steps, the view-based person blob model 732 is employed to estimate thebody blob 730 in the view-based body blob estimation 733 step. Theview-based person blob model 732 consists of an approximate shape of abody outline and the position of the head in each floor position. Oncethe body blob 730 is estimated, the head position 735 is located basedon the model. The gaze direction estimation 950 step provides the gazedirection estimate 952, which is the orientation of gaze directionrelative to the face-view camera 110. The visual target 920 depends onthe face-view camera 110 orientation and the position of the person asshown in the figure; first the gaze direction 901 is interpreted foreach face-view camera 110 using the gaze to target grid mapping 974 toestimate the gaze line 904. The coordinate of the gaze target isestimated by finding the intersection of the gaze line 904 with thevisual target plane 921. The gaze line 904 is a line originating fromthe person's position having the same orientation as the estimated gazedirection.

FIG. 20 shows the view selection 976 scheme in an exemplary embodimentof the present invention. When the system employs multiple face-viewcameras, one person's face can appear to more than one face view 342.Because the accuracy of the gaze target estimation 970 greatly dependson the accuracy of the eye gaze estimation 960, it is crucial to choosethe view that provides a better view of the face—more specifically, theeyes. The step chooses the best view of the face based on the personposition 724 and the three-dimensional facial pose 391 of the person,because both the distance between the face and the camera and thethree-dimensional facial pose relative to the camera affect the view. Inthe figure, the face of person 1 701 may appear to both the face view 1343 and face view 2 344. The face view 1 343 is the selected view 977for the person 1 701 based on the close distance to the face-view camera1 111. The face of person 2 702 can appear to both the face view 2 344and the face view 3 345. The person is slightly closer to the face-viewcamera 3 113, but is facing the face-view camera 2 112. Therefore, theface view 2 344 is the selected view 977. The correspondence of a personappearing to different views is made based on the person positionestimate; the person position estimate also provides distance of theperson to each of the cameras.

FIG. 21 shows an estimated gaze map 980. The darker regions representareas receiving more frequent gaze. The gaze map can reveal how much acertain region in the visual target receives people's attention.

FIG. 22 shows an exemplary embodiment of the weighed voting scheme forgaze map estimation 985. The scheme serves to address the issue ofvarying degree of confidence level for gaze target estimates 972. Thelevel of confidence 963 corresponding to each eye gaze estimate 961computed from the eye gaze estimation 960 step naturally translates intothe confidence level for the gaze target estimate 972. In the figure,the gaze target estimation 970 step computed higher confidence value tothe eye gaze estimate 961 for the left eye image than for the right eyeimage. In this embodiment, the higher confidence gaze target estimatecontributes more to the gaze map 980 in the form of narrow Gaussian-likefunction 990. The lower confidence gaze target estimate contributes lessto the gaze map 980 in the form of flat Gaussian-like function 990. Thedegree of confidence is reflected not only in the shape of the function,but also in the total weight of contribution.

FIG. 23 shows an estimated gaze trajectory 982 of a single viewer 705.The plot can reveal how the interest of the viewer 705 changes over thespan of viewing.

While the above description contains much specificity, these should notbe construed as limitations on the scope of the invention, but asexemplifications of the presently preferred embodiments thereof. Manyother ramifications and variations are possible within the teachings ofthe invention. Thus, the scope of the invention should be determined bythe appended claims and their legal equivalents, and not by the examplesgiven.

1. A method for estimating a gaze target within a visual target that aperson is looking based on automatic image measurements, comprising thefollowing steps of: a) processing calibrations for at least a firstmeans for capturing images for face-view and at least a second means forcapturing images for top-down view, b) determining a target grid of thevisual target, c) detecting and tracking a face of the person from firstinput images captured by the first means for capturing images, d)estimating a two-dimensional pose and a three-dimensional pose of theface, e) localizing facial features to extract an eye image of the face,f) estimating eye gaze of the person and estimating gaze direction ofthe person based on the estimated eye gaze and the three-dimensionalfacial pose of the person, g) detecting and tracking the person fromsecond input images captured by the second means for capturing images,h) estimating a head position using the top-down view calibration, andi) estimating the gaze target of the person from the estimated gazedirection and the head position of the person using the face-viewcalibration.
 2. The method according to claim 1, wherein the methodfurther comprises a step of taking geometric measurements of the siteand the visual target to come up with specifications and thecalibrations for the means for capturing images.
 3. The method accordingto claim 1, wherein the method further comprises steps of: a) estimatinga gaze direction estimation error distribution, and b) determining thetarget grid based on the gaze direction estimation error distributionand average distance between the person and the visual target.
 4. Themethod according to claim 1, wherein the method further comprises a stepof determining a mapping from the estimated head position and theestimated gaze direction to the target grid.
 5. The method according toclaim 1, wherein the method further comprises a step of determining themapping from the second input image coordinate to the floor coordinate,based on the position and orientation of the first means for capturingimages.
 6. The method according to claim 1, wherein the method furthercomprises a step of training a plurality of first machines forestimating the three-dimensional pose of the face.
 7. The methodaccording to claim 1, wherein the method further comprises a step oftraining a plurality of second machines for estimating thetwo-dimensional pose of the face.
 8. The method according to claim 1,wherein the method further comprises a step of training a plurality ofthird machines for localizing each facial feature of the face.
 9. Themethod according to claim 1, wherein the method further comprises a stepof training at least a fourth machine for estimating the eye gaze fromthe eye image.
 10. The method according to claim 9, wherein the methodfurther comprises a step of annotating the eye images with both the eyegaze and a confidence level of the eye gaze annotation.
 11. The methodaccording to claim 10, wherein the method further comprises a step oftraining the fourth machine so that the machine outputs both the eyegaze and the confidence level of the eye gaze estimate.
 12. The methodaccording to claim 1, wherein the method further comprises a step oftraining at least a fifth machine for estimating the gaze direction. 13.The method according to claim 12, wherein the method further comprises astep of training the fifth machine for estimating the gaze directionfrom the eye gaze and the three-dimensional facial pose.
 14. The methodaccording to claim 12, wherein the method further comprises a step ofemploying the fifth machine for estimating the gaze direction from theeye image and the three-dimensional facial pose.
 15. The methodaccording to claim 12, wherein the method further comprises a step oftraining the fifth machine so that the machine outputs both the gazedirection and the confidence level of the gaze direction estimate. 16.The method according to claim 15, wherein the method further comprises astep of estimating a gaze map by weighting each of the gaze targetestimates with the confidence levels corresponding to the gaze directionestimates.
 17. The method according to claim 1, wherein the methodfurther comprises a step of selecting a stream of first input imagesamong a plurality of streams of first input images when the person'sface appears to more than one stream of first input images, based on theperson's distance to each of the plurality of first means for capturingimages and the three-dimensional facial poses relative to each of theplurality of first means for capturing images.
 18. The method accordingto claim 1, wherein the method further comprises a step of utilizing aview-based body blob model to estimate the head position of the person.19. The method according to claim 1, wherein the method furthercomprises a step of constructing a gaze trajectory and a gaze map basedon the estimated gaze target.
 20. An apparatus for estimating a gazetarget within a visual target that a person is looking based onautomatic image measurements, comprising: a) means for processingcalibrations for at least a first means for capturing images forface-view and at least a second means for capturing images for top-downview, b) means for determining a target grid of the visual target, c)means for detecting and tracking a face of the person from first inputimages captured by the first means for capturing images, d) means forestimating a two-dimensional pose and a three-dimensional pose of theface, e) means for localizing facial features to extract an eye image ofthe face, f) means for estimating eye gaze of the person and estimatinggaze direction of the person based on the estimated eye gaze and thethree-dimensional facial pose of the person, g) means for detecting andtracking the person from second input images captured by the secondmeans for capturing images, h) means for estimating a head positionusing the top-down view calibration, and i) means for estimating thegaze target of the person from the estimated gaze direction and the headposition of the person using the face-view calibration.
 21. Theapparatus according to claim 20, wherein the apparatus further comprisesmeans for taking geometric measurements of the site and the visualtarget to come up with specifications and the calibrations for the meansfor capturing images.
 22. The apparatus according to claim 20, whereinthe apparatus further comprises: a) means for estimating a gazedirection estimation error distribution, and b) means for determiningthe target grid based on the gaze direction estimation errordistribution and average distance between the person and the visualtarget.
 23. The apparatus according to claim 20, wherein the apparatusfurther comprises means for determining a mapping from the estimatedhead position and the estimated gaze direction to the target grid. 24.The apparatus according to claim 20, wherein the apparatus furthercomprises means for determining the mapping from the second input imagecoordinate to the floor coordinate, based on the position andorientation of the first means for capturing images.
 25. The apparatusaccording to claim 20, wherein the apparatus further comprises means fortraining a plurality of first machines for estimating thethree-dimensional pose of the face.
 26. The apparatus according to claim20, wherein the apparatus further comprises means for training aplurality of second machines for estimating the two-dimensional pose ofthe face.
 27. The apparatus according to claim 20, wherein the apparatusfurther comprises means for training a plurality of third machines forlocalizing each facial feature of the face.
 28. The apparatus accordingto claim 20, wherein the apparatus further comprises means for trainingat least a fourth machine for estimating the eye gaze from the eyeimage.
 29. The apparatus according to claim 28, wherein the apparatusfurther comprises means for annotating the eye images with both the eyegaze and a confidence level of the eye gaze annotation.
 30. Theapparatus according to claim 29, wherein the apparatus further comprisesmeans for training the fourth machine so that the machine outputs boththe eye gaze and the confidence level of the eye gaze estimate.
 31. Theapparatus according to claim 20, wherein the apparatus further comprisesmeans for training at least a fifth machine for estimating the gazedirection.
 32. The apparatus according to claim 31, wherein theapparatus further comprises means for training the fifth machine forestimating the gaze direction from the eye gaze and thethree-dimensional facial pose.
 33. The apparatus according to claim 31,wherein the apparatus further comprises means for employing the fifthmachine for estimating the gaze direction from the eye image and thethree-dimensional facial pose.
 34. The apparatus according to claim 31,wherein the apparatus further comprises means for training the fifthmachine so that the machine outputs both the gaze direction and theconfidence level of the gaze direction estimate.
 35. The apparatusaccording to claim 34, wherein the apparatus further comprises means forestimating a gaze map by weighting each of the gaze target estimateswith the confidence levels corresponding to the gaze directionestimates.
 36. The apparatus according to claim 20, wherein theapparatus further comprises means for selecting a stream of first inputimages among a plurality of streams of first input images when theperson's face appears to more than one stream of first input images,based on the person's distance to each of the plurality of first meansfor capturing images and the three-dimensional facial poses relative toeach of the plurality of first means for capturing images.
 37. Theapparatus according to claim 20, wherein the apparatus further comprisesmeans for utilizing a view-based body blob model to estimate the headposition of the person.
 38. The apparatus according to claim 20, whereinthe apparatus further comprises means for constructing a gaze trajectoryand a gaze map based on the estimated gaze target.